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ABSTRACT 


Knowledge graph (KG) has played an important role in enhancing the performance of many intelligent 
systems. In this paper, we introduce the solution of building a large-scale multi-source knowledge graph from 
scratch in Sogou Inc., including its architecture, technical implementation and applications. Unlike previous 
works that build knowledge graph with graph databases, we build the knowledge graph on top of SogouQdb, 
a distributed search engine developed by Sogou Web Search Department, which can be easily scaled to 
support petabytes of data. As a supplement to the search engine, we also introduce a series of models to 
support inference and graph based querying. Currently, the data of Sogou knowledge graph that are collected 
from 136 different websites and constantly updated consist of 54 million entities and over 600 million entity 
links. We also introduce three applications of knowledge graph in Sogou Inc.: entity detection and linking, 
knowledge based question answering and knowledge based dialogue system. These applications have been 
used in Web search products to help user acquire information more efficiently. 


1. INTRODUCTION 


A knowledge graph (KG) is a kind of special database which integrates information into an ontology. As 
an effective way to store and search knowledge, knowledge graph has been applied in many intelligent 
systems and drawn a lot of research interest. While many knowledge graphs have been constructed and 
published, such as Freebase [1], Wikidata [2], DBpedia [3] and YAGO [4], none of these works could 
completely fulfill the application requirement of Sogou Inc. The main challenges are listed below: 
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Lack of data: Though the biggest published knowledge graph (Wikidata) is reported to contain millions 
of entities and billions of triples, most of their data are extracted from Wikipedia and are still far less than 
fulfilling the requirements of Web search applications such as general purpose question answering and 
recommendations. For example, none of the existing knowledge graphs contains the latest Chinese songs’ 
information which can only be obtained from specific websites. 


Uncertainty of scalability: None of the existing works explicitly report their systems’ capability to deal 
with large-scale data or discuss how the knowledge graph could be expanded on server cluster. This 
problem might not be very important for academic research since even the biggest knowledge graph’s data 
can still be held by single server with a large hard disk drive. In the case of search engines, the potential 
data requirement of a knowledge graph is much larger and using distributed storage is unavoidable. 


To solve these challenges, we propose a novel solution of building a large-scale knowledge graph. We 
use a distributed search engine called SogouQdb that is developed by Sogou Web Search Department for 
inner use as the core storage engine to obtain the capability of scalability, and develop a series of models 
to supply inference and graph-based querying functions which make the system compatible with the other 
knowledge graph applications. The inference is conducted on HDFS with Spark which makes the inference 
procedure capable of dealing with big data. The Sogou knowledge graph is built with this solution and has 
been published to support online products. Currently, the Sogou knowledge graph consists of 54 million 
entities and over 600 million entity links. The data are extracted from 136 different websites and constantly 
updated. 


We also introduce three applications of knowledge graphs in Sogou Inc.: entity detection and linking, 
knowledge-based question answering and knowledge-based dialogue systems. These applications have 
been used as an infrastructural service in Web search products to help users find the information they want 
more efficiently. 


The rest of this paper is organized as follows: In Section 2, we introduce the related works of widely 
known published knowledge graphs. In Section 3, we elaborate our solution to construct a knowledge 
graph from scratch. Section 4 presents the application of knowledge graphs, especially in Sogou Inc. Finally, 
we draw a conclusion in Section 6. 


2. RELATED WORK 


While many works about building a domain-specific knowledge graph have been published, we focus 
on works of building large-scale multi-domain knowledge graphs and list the most widely known works in 
this section. 


Freebase [1] was published as an open shared database in 2007 and was shut down in 2016 after all of 
its data are transferred to Wikidata. The data of Freebase were collected from Wikipedia®, NNDB®, Fashion 


®  https:/www.wikipedia.org/ 
®  http:/Avwww.nndb.com 
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Model Directory® and MusicBrainz®, and were also contributed by its users®. Freebase has more than 
1.9 billion triples®, 4,000 types and 7,000 properties [1]. 


Wikidata [2] was firstly published by Wikidata in 2012 and has been publicly maintained until now. The 
data of Wikidata® that contain more than 55 million entities mainly come from its Wikipedia sister projects 
including Wikipedia, Wikivoyage, Wikisource and other websites. 


DBpedia [3] is a large-scale multilingual knowledge graph and its data were extracted from Wikipedia 
and collaboratively edited by the community. The English version of DBpedia® contains more than 4.58 
million entities and data of DBpedia in 125 languages have 38.3 million entities [5, 6, 7]. 


YAGO [4] is an open-sourced semantic knowledge graph derived from Wikipedia, WordNet and 
GeoName. YAGO has more than 10 million entities and 120 million entities’ facts®. 


ConceptNet [5] that originated from the Open Mind Common Sense project which was launched in 
1999 has grown to be an open multilingual knowledge graph. ConceptNet contains more than 8 million 
entities and 21 million entity links. 


CN-DBpedia [6] is a Chinese KG published in 2017 that specifically focuses on extracting knowledge 
from Chinese encyclopedias. CN-DBpedia has more than 16.8 million entities and 223 million entity links®. 


3. CONSTRUCTION 


An overview of the construction framework of Sogou knowledge graph is shown in Figure 1. The data 
of Sogou knowledge graph are collected from various websites which allow their data to be downloaded 
or crawled, e.g., Wikipedia and SogouBaike. The extracted data are stored in a distributed database in the 
form of JSON-LD (JavaScript Object Notation for Linked Data) which is a commonly used concrete RDF 
syntax. As an additional way to supply data, we introduce inference model which infers new relationships 
between entities. To search and browse the knowledge graph, a SPARQL query engine is developed that 
provides RESTful APIs services. For supporting a search engine’s products like question answering and 
recommendation, the knowledge graph data are processed to adapt to the data form of specific tasks. In 
this section, we give an introduction of each part of the construction framework. 


https://www.fashionmodeldirectory.com/ 
https://musicbrainz.org 
https://en.wikipedia.org/wiki/Freebase 
https://developers.google.com/freebase/ 
https://www.wikidata.org/wiki/Wikidata 
https://wiki.dbpedia.org/about. 
https://datahub.io/collections/yago 
http://kw.fudan.edu.cn/ 
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3.1 Data Extraction 


The role of data extraction is extracting data into pre-defined form from various input data. Specifically, 
the input and output of data extraction are defined as follows: 


Input: Data downloaded or crawled from the Internet, e.g., the Web pages, XML data or JSON data 
downloaded by APIs. While the input data comprise mostly of free text, many data contain structured 
information such as: images, geo-coordinates, links to external Web pages and disambiguation pages. 
Output: Structured data in the form of JSON-LD that record the knowledge information extracted 
from the input data. 


Data extraction operations can be classified into two categories: Structured data extraction only deals 
with the input data with structured information, specifically, the data that contain recognizable markup. 
Free text extraction detects entities and extracts the property information of specific entities from free text. 
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Figure 1. Overview of Sogou knowledge graph construction framework. The framework could be divided into 
three parts: Data Preparation contains operations including collecting data from various sources, extracting data 
from both structured source and free text and normalizing data; Knowledge graph construction contains all models 
to build a knowledge graph based on the extracted and normalized data; Application is composed of applications 
or services of a knowledge graph. A box with solid line represents an operation or model to process data while a 
box with dashed line represents the intermediate data. 
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3.1.1 Structured Data Extraction 


As the structured information has recognizable markups, we use rule-based method to build the extractors. 
The extractors firstly parse the Web page to unified DOM-tree, then find the target information according 
to the manually written rules and save the extracted data in JSON-LD form. For each website, we build 
specialized extractors to deal with its data to make it independently update the data of different websites. 
Currently, in March 2019, Sogou knowledge graph system has 45 websites as data sources and 77 rule- 
based extractors. 


3.1.2 Free Text Extraction 


The task of free text extraction is combined with a series of sub-tasks including extracting named entity 
mentions from plain text, linking the mentions to the entities in knowledge graphs and extracting entities’ 
properties or the relationships between extracted entities. Since training a model that could deal with all 
entity types is quite time consuming, we currently just focus on limited types of entities including: Person 
(PER), Geo-political Entity (GPE), Organization (ORG), Facility (FAC) and Location (LOC). For named entity 
recognition and linking tasks, we train a Bi-LSTM-CRF model and the feature and parameter selection 
follows work of [7] which got the best performance in TAC KBP 2017 competition [8]. The training data 
are constructed by the SogouBaike and Wikipedia Web pages that contain anchor markups. More details 
of the model and the training data can be found in Section 4.1. 


3.2 Normalization 


This part normalizes property values of extracted entities and maps entities’ class and property to terms 
in the Sogou knowledge graph’s ontology. Besides, data types of property are also specified, which ensures 
the high quality of processed data. The input and output of this part are defined as follows: 


Input: Output of data extraction: Structured data in the form of JSON-LD. 

Output: Structured data in the form of JSON-LD with normalized property name and property value. 
The type of property value follows the definition of Sogou knowledge graph schema. A simplified 
example is given below: 


{ 
“@context”: {“@vocab”: “http://schema.sogou.com” “kg”: “http://kg.sogou.com” 
} 
“@id”: “4962641”, “@type”: [“Person”], “name”: “Dehua Liu”, “birthDate”: “1961-09-27”, 
“hasOccupation” [“Singer”, “Actor”] “sogouBaikeUrl”: “https://baike.sogou.com/v4962641 .htm” 
} 


The schema http://schema.sogou.com used in Sogou knowledge base is compatible with http://schema.org. 
Currently, we maintain only one knowledge graph at http://kg.sogou.com while the framework can support 
more knowledge graphs by setting different values. 
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3.3 Merging 


The merging section is the entrance of KG storage which is a distributed database storing the whole 
knowledge graph. Any operations aiming to change the KG database including adding new data, updating 
or deleting data have to be transformed into unit operations following a pre-defined interface (including 
“add”, “update” and “delete”) in the merging section. All unit operations are executed with logs which can 
be used to roll back to any historical version. 


For adding entities, the merging section checks whether the entity already exists in the KG database. If 
the entity to be added is found in database, the old entity’s property value will be updated to the value of 
added entity’s same properties. Otherwise, the entity will be added into the database as a new entity. To 
distinguish the entities with the same name, we develop a heuristic model that also compares the entities’ 
property values. For updating and deleting data, the @id property is required and the operation will be 
executed to the entities with given ids. 


3.4 Inference 


As an additional way to supply data, the inference section infers new relationships of entities based on 
the existing relations. For example, when we know A is B’s son, we could infer a new relation that B is A’s 
father. In the construction framework, the inference is conducted on the whole data that are dumped from 
KG database and the inference result is added back to the KG through the merging part. Currently, all of 
our inference models are rule-based. While neural network based inference methods (such as TransE and 
TransR) can infer more potential relations, the accuracy of these inference models’ result is not good enough 
to be applied to products. 


3.5 Knowledge Graph Storage 


The Sogou knowledge graph storage is developed on top of SogouQdb which is an open source search 
engine. Figure 2 gives an overview of the architecture of the KG storage. SogouQdb is used as a distributed 
database to store data and provide search services. KG Storage Service wraps up SogouQdb to provide 
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Figure 2. Overview of Sogou knowledge graph storage architecture. 
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storing and querying APIs that are more proper for applications of knowledge graph based cases. In practice, 
we find the querying requests are much more than storing requests and cost more computation resources. 
To reduce cost and improve querying speed, a cache layer is added between querying API and the KG 
storage service. 


Compared with graph databases such as Neo4j and OrientDB which is commonly used in knowledge 
graph storage, using SogouQdb has more advantages on querying speed, scalability and more engineering 
optimizations. One disadvantage of SogouQdb is that it does not natively support knowledge graph query 
languages such as SPARQL. To solve this problem, we introduce the KG storage service to parse SPARQL 
to SogouQdb’s APIs. Another disadvantage is that SogouQdb is relatively inefficient for conducting data 
inference. To solve this problem, we separate the inference part from KG storage and conduct the inference 
on HDFS using Spark. The data to be inferred are dumped from SogouQdb using Qdb-Hadoop tools. 


4. APPLICATION 
4.1 Entity Linking 


The entity linking task identifies the character string representing the entity from the natural language 
text and maps it to a specific entity in the knowledge base. For example, Wiki editors manually add 
hyperlinks to phrases representing entities in the text to the corresponding Wikipedia pages. This phrase 
with Wiki internal hyperlinks is called Anchor Text. Traditional entity linking method is based on feature 
engineering. This kind of method calculates the link matching degree through the features between the 
candidate entity and its context. Features usually include prior information of entities, contextual semantic 
features, and features associated with entities. Commonly used models include Ranking SVM [9], CRF [10] 
and S-MART [11]. With the development of neural networks, feature learning is gradually replacing the 
original method based on feature engineering. This kind of method calculates the context representation of 
the entity phrase and the representation of the candidate entity through a specific neural network. The 
matching score is defined as the similarity between vectors. The entity linking models based on deep 
learning include [12, 13, 14, 15]. In addition, knowledge graph embedding is also applied to entity linking 
tasks. The vector representation of each entity is learned through a large number of knowledge base triplets 
as training data, so that similar entities have similar vector representations. The methods of vector learning 
based on knowledge base include [12, 16, 17, 18, 19]. 


The focus of entity linking is to find the correct entity from multiple candidates and eliminate ambiguity. 
For example, “Li Na” has multiple possible candidate entities, which may represent a tennis star, pop singer, 
football baby of Sogou, or even a movie with the same name. In the absence of context information, it is 
difficult to link entities accurately. A well-designed entity linking service needs to consider many factors, 
including the prior knowledge of the entity itself, the matching degree between the entity and the phrase, 
and the fit degree between the context in which the entity and the phrase are located. 
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In Sogou, the entity linking problem is treated as a ranking problem. We take into consideration the 
entity prior, the similarity between the entity description and the context, and the coherence between 
entities and entities in the same paragraph. Based on the knowledge graph of Sogou, we have developed 
a set of entities linking APIs, which provides short text linking service, long text linking service and table 
linking service. These services link the entities contained in the text to the Sogou Knowledge Graph. 


The short text entity linking service is mainly used for entity linking of query text in our search engine. 
After entity linking, the structural information in the knowledge graph related to the entities is shown to 
the user, along with illustrations and pictures (Figure 3). At the same time, based on the type of entities and 
the relationship between entities in the knowledge graph, recommendations of relevant entities are given 
(Figure 4). These richer results make it quicker for users to obtain what they want and what they are 
interested in. Also, entity linking is the basis for automatic question answering, especially for the task of 
knowledge-based question answering. The existing entities in the question need to be accurately linked, to 
limit the scope of semantic search. 


Long text entity service is mainly used for Anchor Text generation in Web pages, as shown in Figure 5. 
In order to help readers quickly access the introduction information of entities in Sogou Encyclopedia’s 
pages, these entities contain hyperlinks to their own pages, i.e., Anchor Text. Our automated entity linking 
service greatly improves the manual editing efficiency. In addition, the long text entity linking service is 
also applied to Sogou’s news feed with personalized recommendations. In combination with the entity 
linking process, similar or related entities are extended in the knowledge graph. Thus we can provide 
personalized recommendations, along with very interpretable reasons. 
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Figure 3. After query entity linking, entity-related 
information and pictures are shown in search results. 
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Figure 5. Long text entity service used for Anchor Text generation. 


Table entity linking is also used to generate entity Anchor Text in tables online, such as entities in tables 
of Sogou Encyclopedia. Meanwhile, tables provide rich entity type information, entity relationship 
information, etc. After entity linking, these tables can also supply a large amount of high confidence triplet 
information to our knowledge graph. 


4.2 Knowledge-Based Question Answering 


Knowledge graph usually comes with a descriptive language, such as MQL provided by Freebase, 
SPARQL formulated by W3C, and CycL provided by Cyc. However, for ordinary users, this structured query 
syntax has a high usage threshold. A knowledge-based question answering system uses natural language 
as interface to provide a more friendly way for knowledge querying. On the one hand, natural language 
has very strong expressive power. On the other hand, this method does not require users to receive any 
professional training. Due to its broad application prospect, knowledge-base question answering (KBQA) 
has become a research hot-spot in both academia and industry. 


For question understanding, we focus on the automatic question answering task based on a knowledge 
graph. The task is to find one or more corresponding answer entities from the knowledge graph for questions 
describing objective facts. For a question that contains only simple semantics, the process of automatic 
question answering is equivalent to converting the question into a fact triplet on the knowledge base. 
However, the problems raised by human beings are not always presented in simple forms. More restrictions 
will be added to them. For example, there are multiple entities and types related to the answer in the 
question. In complex semantic scenarios, the KBQA has the following challenges: 1) How to find multiple 
relationships from questions and combine them into a candidate semantic structure; 2) How to calculate 
the matching degree between natural language questions and complex semantic structures. 


Commonly used methods are based on semantic parsing or ranking. The method based on semantic 
parsing is to convert the question into a formal query statement of a certain standard knowledge base, i.e., 
finding the optimal (question, semantic query) pair instead of a simple answer entity. Related work includes 
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the generation of semantic parsing trees using the Combinatory Categorial Grammar(CCG) [20, 21, 22], 
and A-DCS [23, 24, 25]. Typical application projects include ATIS [26] in the air travel information question 
and answer system, CLANG [27] in the robot soccer game, GeoQuery [28] in the US geographic knowledge 
question and answer system, and an open source question and answer system SEMPRE [23]. The 
ranking method does not need formal representation of questions, but directly ranks candidate entities or 
answers in the knowledge base. This kind of method follows the representation-comparison framework, in 
which the traditional feature-based engineering methods include [29] and deep learning based methods 
include [30, 31, 32]. 


We have implemented a KBQA system and integrated it into Sogou Search Engine (Figure 6) and Sogou’s 
dialogue service. Sogou’s KBQA relies mainly on the combination of manual templates and models. By 
using templates, the user’s query is directly converted into structural KB query. In the model approach, the 
entities in the query are first linked to the knowledge graph, and then a subgraph is constructed with the 
entity as the center. The final answer is to sort the results by using the nodes and edges in the subgraph as 
candidate paths and answers. 
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Figure 6. KBQA in Sogou Search Engine, which answers the query directly in the first search result. 


4.3 Knowledge-Based Dialogue System 


Knowledge based dialogue is a more natural and friendly knowledge service, which can satisfy users’ 
needs and complete specific knowledge acquisition tasks through multiple rounds of human-agent 
interaction. The latest development in dialogue systems is based on deep learning techniques, using the 
encoder-decoder model to train the entire system. Related work includes [33, 34, 35, 36]. Combining an 
external knowledge base is a way to bridge the gap between the dialogue system and humans. Using 
memory network, [37, 38] have achieved good results in the open domain dialogue. Combining words in 
the generation process with common words in the knowledge base, [39] produces natural and correct 
answers. [40] uses Twitter’s LDA model to get the input topic, and add the topic information and input 
representation to the joint attention module to generate a topic-related response. [41] classifies each 
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discourse in the conversation into a field and uses it to generate the domain and content of the next 
discourse. Dialogue system also needs personality and emotion to look more like humans. [42] applies 
emotion embedding into the generative model. [43, 44] both consider the user’s information in creating a 
more realistic chat bot. 


With the large-scale growth of knowledge graph resources and the rapid development of machine 
learning models, dialogue systems are gradually moving from limited areas to open areas. Sogou Wang Zai 
Robot is an automatic question-answering robot developed by Sogou, as shown in Figure 7. It combines 
Sogou’s knowledge graph, Sogou’s dialogue technology and Sogou’s intelligent voice technology to provide 
accurate answers in daily conversations. 


Dialogue generation based on knowledge graphs is a key technology in knowledge-based dialogues. 
Traditional KBQA provides only accurate answers to all questions. For example, when asked “How tall is 
Andy Lau?”, the system only returns “174 cm”. However, merely providing this kind of answer is not a 
friendly interactive way. Users prefer to receive “The height of Andy Lau, actor of Hong Kong, China, is 
174 cm”. This way provides more background information related to the answer (for example, actor of 
Hong Kong, China). In addition, this complete natural language sentence can better support the follow-up 
tasks such as answer verification and speech synthesis. In order to generate natural language answers, we 
use the encoder-decoder framework. Copy and retrieval mechanism is also introduced for complex questions 
that require facts in the knowledge graph. Different types of words are obtained from different sources by 
using different semantic unit acquisition methods such as copy, retrieval or prediction. Thus natural answers 
are generated for complex questions. 
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Figure 7. A user is having a dialogue with our conversational assistant, Wang Zai. 
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Another problem that needs to be solved in the dialogue agent is the consistency of the dialogue, i.e., 
the stability of the agent’s portrait. It also requires the integration of external knowledge, e.g., personal 
information in Table 1. Although the agent is a robot, it needs to have a unified personality. Its gender, age, 
native place and hobbies should always be the same. When asked “where were you born?” or “Are you 
from Beijing?”, the answer will always be consistent. We model Sogou Wang Zai’s information and import 
it into an encoder-decoder model in embeddings. Thus when the question is related to personal information, 
it will generate responses from vectors of the identity information, which achieves good consistency effect. 


Table 1. Wang Zai’s personal information for more consistent question answering. 


Profile key Profile value 
Name Wang Zai 
Age Three 
Gender Boy 
Hobbies Cartoon 
Speciality Piano 


5. CONCLUSION 


In this paper, we propose a novel solution that is used in Sogou Inc. in building knowledge graphs on 
top of a distributed search engine, specifically, SogouQdb. Our solution supplies SogouQdb by introducing 
data inference and graph-based query engine which makes the solution compatible with commonly used 
knowledge graph applications. Besides, benefited from SogouQdb, the Sogou knowledge graph can be 
easily scaled to store petabytes of data. We also introduce three applications of a knowledge graph in Sogou 
Inc.: entity detection and linking, knowledge-based question answering and knowledge-based dialogue 
system which have been used as the Web search products to make knowledge acquisition more efficient. 
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